Aurélien Goutsmedt, Thomas Laloux and Marine Bardou (UCLouvain)
2024-10-31
1 Training Goals
Motivations
First Session:
Giving a basic understanding of what web scraping is and what it can do
Discussing ethical (and legal) issues linked to web scraping
Proposing a roadmap for understanding how to proceed for practicing web scraping using R
Providing bits of codes and practical tips
Second Session:
Hands-on practice with different exercises by level of difficulty
Requisits
Need of R and RStudio for the second session (please make sure to install them!)
These slides are built from a .qmd (quarto) document \(\Rightarrow\) all the codes used in these slides can be run in RStudio
# These lines of code has to be run before if you want to install all the packages directly# pacman will be used to install (if necessary) and load packages# We install pacman if it is not already installedif(length(grep("pacman", installed.packages())) ==0) install.packages("pacman")library(pacman)# Installing the needed packages in advancep_load(tidyverse, # basic suite of packages glue, # useful for building string (notably for url) scico, # color palettes patchwork, # for juxtaposition of graphs DT) # to display html tables
2 What is web Scraping
What is web scraping ?
The web scraping is a method for extracting data available in the World Wide Web
The World Wide Web, or “Web”, is a network of websites (online documents coded in html and css)
A web scraper is a program, for instance in R, that automatically read the html structure of a website and extract the relevant content (text, hypertext references, tables)
No need to fully understand html and css
Useful when many pages to scrape
What is HTML and CSS?
API vs. web scraping
API (Application Programming Interface) provides a structured and predictable way to retrieve data from a service. It’s like ordering from a menu; you request specific data and receive it in a structured format
API for bibliometric data like Scopus, Web of Science, Google Scholar…
Web Scraping is the process of programmatically extracting data from the web page’s HTML itself. It’s akin to manually copying information from a book; you decide what information you need and how to extract it
API vs. web scraping
Control and Structure: APIs offer structured access to data, whereas web scraping requires parsing HTML and often cleaning the data yourself.
Ease of Use: Using an API can be simpler since it’s designed for data access (but not always the case). Scraping requires dealing with HTML changes and is more prone to breaking.
Availability: Not all websites offer an API, making web scraping a necessity in some cases.
Limitations and Authorization: APIs often have rate limits and may require authentication, but approve access to the data. Web scraping can bypass these limits but might violate terms of service.
Forget about big data, small data is everywhere!
A large possibility of data you can collect:
official documents/speeches
agenda and meetings
list of personnel or experts in commission
laws or negotiations
To take into account the development of pages over time, we can do it through the Internet Archive
Building databases
Involves a series of questions:
What’s your research question and which data would be appropriate to answer it?
How much data to collect?
Trade-off between collecting a lot of information (which requires more time) and risking to miss some information at a later step
How to scrape the data? In which format?
Interaction between extracting data properly in a first step or cleaning it in a second step
What do you loose by doing it automatically rather than manually? (or the reverse)
How to analyse/understand my new data?
How to update my database?
3 The Ethics of Web Scraping
Ethical considerations
Legal Considerations: Not all data is free to scrape. Websites’ terms of service may explicitly forbid web scraping, and in some jurisdictions, scraping can have legal implications
What is “forbidden” by a website is not necessary “illegal”
Privacy Concerns: Scraping personal data can raise significant privacy issues and may be subject to regulations like GDPR in Europe
Website Performance: Scraping, especially if aggressive (e.g., making too many requests in a short period), can negatively impact the performance of a website, affecting its usability for others
session <- polite::bow(bis_website_path, user_agent ="polite R package - used for academic training by Aurélien Goutsmedt (aurelien.goutsmedt[at]uclouvain.be)")
field useragent value
1 Sitemap * https://www.bis.org/sitemap.xml
Using sitemap
Code
# This function goes to a sitemap page, and extract all the urls foundextract_url_from_sitemap <-function(url, delay =1) { urls <-read_html(url) %>%html_elements(xpath =".//loc") %>%html_text()Sys.sleep(delay) # You set a delay to avoid overloading the websitereturn(urls)}# insistently allows to retry when you did not succeed in loading the pageinsistently_extract_url <-insistently(extract_url_from_sitemap, rate =rate_backoff(max_times =5)) document_pages <-extract_url_from_sitemap(session$robotstxt$sitemap$value) %>% .[str_detect(., "documents")] # We keep only the URLs for documentsbis_pages <-map(document_pages[1:5], # showing the code just on the first five years~insistently_extract_url(url = ., delay = session$delay))bis_pages <-tibble(year =str_extract(document_pages[1:5], "\\d{4}"),urls = bis_pages) %>%unnest(urls)
[1] "Speech by Dr Joachim Nagel, President of the Deutsche Bundesbank, at Harvard University, Cambridge, 22 October 2024."
page %>%html_elements(".Reden") %>% html_text
[1] "Ladies and gentlemen,"
[2] "it is a great pleasure to be at Harvard again, to meet long time companions like Hans-Helmut Kotz and to exchange ideas with top scientists such as Benjamin Friedman. When I was in this round two years ago, we were dealing with an unprecedented global inflation spike. Fortunately, the worst is behind us, and inflation in the euro area is heading back to the Eurosystem's target. We have not brought the inflation ship safely back into the 2% harbour, but the port is in sight. Thus, I can focus on another question today."
[3] "Before I do that, let me share an analogy to set the stage for my discussion. Back in the 1970s and 1980s, the field of economics was split into two seemingly incompatible schools of thought: New Keynesian and New Classical. Their proponents were not too polite in their language, calling assumptions \"foolishly restrictive\" or comparing an opponent to someone attempting to pass himself off as Napoleon Bonaparte. But, over time, ideas from both camps ultimately merged to form a consensus called the New Neoclassical Synthesis, the very foundation of modern macroeconomics. Gregory Mankiw neatly described this story in his essay \"The Macroeconomist as Scientist and Engineer\"."
[4] "The takeaway from this analogy is that complex issues are rarely black or white. With this in mind, I want to explore whether the conduct of monetary policy in the euro area could be enhanced by offering more detailed and nuanced information regarding its future outlook. More specifically, today I will address the following question: Should the Eurosystem introduce dot plots?"
page <-2day <-"01"month <-"10"year <-2024# we want to look at all the speech since October 1st 2024url_second_page <-glue("https://www.bis.org/cbspeeches/index.htm?fromDate={day}%2F{month}%2F{year}&cbspeeches_page={page}&cbspeeches_page_length=25")print(url_second_page)
# Launch Selenium to go on the website of bisdriver <-rsDriver(browser ="firefox", # can also be "chrome"chromever =NULL,port =4444L) remote_driver <- driver[["client"]]
starting_url <-glue("https://www.bis.org/cbspeeches/index.htm?fromDate={day}%2F{month}%2F{year}&cbspeeches_page=1&cbspeeches_page_length=25")remote_driver$navigate(starting_url)# Extract the total number of pagesnb_pages <- remote_driver$findElement("css selector", ".pageof")$getElementText()[[1]] %>%str_remove_all("Page 1 of ") %>%as.integer()# creating a list objet to allocate progressively informationmetadata <-vector(mode ="list", length = nb_pages)for(page in1:nb_pages){ url <-glue("https://www.bis.org/cbspeeches/index.htm?fromDate={day}%2F{month}%2F{year}&cbspeeches_page={page}&cbspeeches_page_length=25") remote_driver$navigate(url) nod <-nod(session, url) # introducing politely to the new pageSys.sleep(session$delay) # using the delay time set by polite metadata[[page]] <-tibble(date = remote_driver$findElements("css selector", ".item_date") %>%map_chr(., ~.$getElementText()[[1]]),info = remote_driver$findElements("css selector", ".item_date+ td") %>%map_chr(., ~.$getElementText()[[1]]),url = remote_driver$findElements("css selector", ".dark") %>%map_chr(., ~.$getElementAttribute("href")[[1]])) }metadata <-bind_rows(metadata) %>%separate(info, c("title", "description", "speaker"), "\n")driver$server$stop() # we close the bot once we've finished
You want to scrape and analyze results of the 2024 UK General Election from the BBC website.
Objectives:
Count the number of parties with at least one seat.
Determine which parties won or lost compared with the previous election.
Calculate and visualize the average votes per seat for each parties with at least one seat.
Compare findings for the entire UK vs. England.
Easy exercise
Approach
Use the rvest and polite packages to retrieve data from the BBC website for party names, seats, votes, and seat changes for all parties in the UK.
Organize data into a dataframe and clean it. Converte seat and vote counts to numeric and remove extraneous symbols.
Analysis:
Count number of parties with at least one seat.
Order the parties according to seat gains/losses.
Calculate votes per seat for each party with a least one seat.
Plot the number of votes per seat for all parties with a least one seat
Repeat the process for parties in England and then compare the results between the UK and England.
Medium exercise
You want to know what happened to the files which were EU legislative priorities in 2023-2024
1. Scrape the basic information We are going to list all relevant procedures. In the EU, once proposed each piece of legislation has a procedure number, including ‘COD’. Go to this page that lists the legislative files which were priorities for 2023-24: https://oeil.secure.europarl.europa.eu/oeil/popups/thematicnote.do?id=41380&l=en You have to scrape this page to obtain a data frame, in which there will be the title, number and url towards the specific page of each procedure
Check one or two links - can you copy paste them in a browser and access the page? Is there anything missing in the URL? How could you fix this? Search manually a procedure here to find out how urls are made: https://oeil.secure.europarl.europa.eu/oeil/search/search.do?searchTab=y Tip: you can use the paste function
Medium exercise part 2
2. Filter only the procedures of interest Now you have listed the names of all relevant procedures, and the links to access them. You are only interest in procedures having COD in their name. Create a data frame that contains only procedure with ‘COD’ in their number. Tip: you can use the the str_detect() function of stringr.
Medium exercise part 3
3. Scrape a single page Take this single URL link: https://oeil.secure.europarl.europa.eu/oeil/popups/ficheprocedure.do?reference=2021/0433(CNS)&l=en It is one of the ones you have listed In a separate data frame (which will have only one line, and three columns), scrape: - the status of the procedure (i.e. at which stage it is) - the date at which the legislative file was published - the date at which the EP took its decision
3.a. Status of the procedure (observe: here the css selector is very “human readable”) 3.b. Date of publication of legislative proposal Tip: first, select all the dates. Then, select the names of all the events to which they correspond. Finally, select your event of interest with grepl (use for example “proposal”) 3.c Date of EP decision 3.d. Put everything in tibble
Medium exercise part 4
Writing a function Write a function that automates the scraping you did at question 3. (generalize your code!). For each URL, the function has to scrape the same three pieces of information. Run that function and store the results in a data frame that also contains the number of the procedures and their URLs.
4.a Write the function. You can find explanations about creating a function in R here: https://www.r-bloggers.com/2022/04/how-to-create-your-own-functions-in-r/
Tip: In the function, you can use the tibble() function to bind the different information together At the end, write return(created_data_frame). This indicates to R that it is the output of the function.
Tip: some of the info you are looking for may not be on all pages. Use the function length() to check whether your code found something, and write “To check” if the information is not found. Why is some info missing on some pages?
4.b Test the function. To do this, run the function on one of the links (only one!)
4.c Run the function. Use as input the list of URLs that you made previously. You will need to use the lapply() function to apply you function to this list of URLs. We can test this on the ten first links Once you ran the function, you need to bind the results together by “horizontally” (i.e. above each other) Otherwise, you just have a list of separate data frames for each procedure. Which R function allows you to do this? You can use bind_rows() from dplyr to aggregate all the tibbles.
4.d Bind the results of your scraping with your original dataframe containing the links. Tip: we can do a bind_col() because here, we are sure that our input links (and procedures) are in the same order as the results. Otherwise, more generally, it is preferable to have a common identifier in each table and to use a join function.
Medium exercise part 5
Explore the data
5.a Calculate the duration of each legislative process (in days) in a new column of your data frame. Calculate it as the number of days between the legislative proposal and the EP decision. Tip: you have to tell R that you are working with dates. Search which function allows to do this!
5.b What happens to the cases where the date of EP decision is not yet available? Pay attention when calculating the duration!
5.c Let’s look for the longer process. When did that procedure start?